Introduction to Open Data Science - Course Project

1st excersice

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Nov 27 13:33:10 2023"

My experience with RStudio

I feel fairly comfortable working with RStudio as it is one of the main tools I use daily. However, after browsing trough the “R for health Data Science” I feel there are several ways to use R. I tend to structure my code in rather different fashion. Perhaps the most striking difference is the use of pipeline. Example file uses pipelines everywhere! where as I tend to avoid it as i feel it makes the code hard to follow even if it would shorten and simplify the code. For example if we create random data frame with tree columns

exampledata1=c(2,4,6,6,8,9,10)
exampledata2=c(70.2,44.0,62.7,106.1,81.1,96.1,10.8)
exampledata3=c(16,23,10,12,9,10,17)
df=data.frame(exampledata1,exampledata2,exampledata3)
print(df)
##   exampledata1 exampledata2 exampledata3
## 1            2         70.2           16
## 2            4         44.0           23
## 3            6         62.7           10
## 4            6        106.1           12
## 5            8         81.1            9
## 6            9         96.1           10
## 7           10         10.8           17

and we want to caluclate mean of the 3rd column, I would use following command:

mean(df$exampledata3)
## [1] 13.85714

whereas excercise dataset would use command

df$exampledata3 %>% mean()
## [1] 13.85714

Both ways lead to same results, even if syntax is different.

Another difference is plotting. Excercise uses ggplot2 where I am used to use R base plots. I know ggplot2 is more versatile and widely used tool. How ever i find its logic confusing compared to base plots. I wish i will learn to use ggplot2 during this course.

Overall i cannot say how well exercise set worked as crash course as most of this was already familiar to me. How ever the parts about ggplot2 was my favorite as it helped me to understand logic behind commands. Using markup is not familiar to me - i have only used Rscripts earlier.

Chapter 2: Linear regression

date()
## [1] "Mon Nov 27 13:33:10 2023"

Data

First step is to read the data created during the data wrangling part. In this case it is stored on my computer.

learningdata=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/learning2014.csv")
head(learningdata)
##   X gender Age Attitude     deep  stra     surf Points
## 1 1      F  53       37 3.583333 3.375 2.583333     25
## 2 2      M  55       31 2.916667 2.750 3.166667     12
## 3 3      F  49       25 3.500000 3.625 2.250000     24
## 4 4      M  53       35 3.500000 3.125 2.250000     10
## 5 5      M  49       37 3.666667 3.625 2.833333     22
## 6 6      F  38       38 4.750000 3.625 2.416667     21

Data appears to be in good order. After importing data it is possible to inspect it.

str(learningdata)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

There are 8 different variables (X, gender, age, attitude, deep, stra, surf and points). Data includes different variables related to learning. X is number used to identify observations. It is irrelevant for the data exploration as it is not measurement etc. Just a tool to identify different persons. Therefore it can be excluded from graphical investigation. Scaterplot matrix is a useful tool to understand different relationships between variables. Gender could be treated as dummy-varable as it is ether male of female in this data. To clarify the visuals we can use different collours on both gender.

library(GGally)
## Warning: package 'GGally' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
p <- ggpairs(learningdata[-1], mapping = aes(alpha=0.3, col=gender), lower = list(combo = wrap("facethist", bins = 20)))
p

Correlation between variables can be seen in the top/right part of the graph (total and gender separately). left/bottom part of the graph shows scatter plot. Axis on th middle shows the distribution of observations. Top row is a boxplot on effect of gender to different variables. left column is histogram of variable by gender. Top left corner is histogram of genders. There are several aspects we can observe from this one graph.

  • There are more females than males in the data set
    • Most of the test subjects are young: about 20-25 years old
      • Males are usually slightly older.
      • There is a lack of observations of middle aged men.
    • On average males have higher attitude score than women.
    • There are little to no difference in deep learning between genders.
    • The strategic and surface learning score is slightly higher on women.
    • On average there are no differences between scores, but the top scores are slightly male dominated.
  • For the most parts there are low correlation between different variables. How ever there are expections:
    • Points and gender have positive correlation (0.44)
    • Surface and deep learining has negative correlation (-0.32)
      • For males correlation is high, over -0.6, where on women it is just -0.09

Modeling

To create model we should select variables that have high correlation with explained variable (points in this case). How ever if 2 variables have high correlation with explained variable, but they are also highly correlated with each other, they should be not both used. In this case points are correlated with attitude. Other attributes have quite low correlation, but variables stra and surf has some level correlation with points, so we add them to model as well.

model = lm(Points ~ Attitude + stra +surf, data = learningdata)
summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learningdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

According to the summary table, surf has extraordinary large p-value and therefore model can be simplified.

model2 = lm(Points ~ Attitude + stra, data = learningdata)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

If we drop the surf variable form the model, we get model more simple model with higher r-squared and smaller p-value. Now the instructions of this exercise advise that non significant terms should be removed. How ever the question of significance is not black and white. It is more about how much uncertainty we are ready to accept. In model2 variable stra has p-value if 0.089. If we use 95% confidence level, the term is ot significant (p > 0.05) but if we use 90% confidence, it is (p < 0,1). Luckily we have more tools as decide whether or not we wish to simplify the model further.

model3 = lm(Points ~ Attitude, data = learningdata)
summary(model3)
## 
## Call:
## lm(formula = Points ~ Attitude, data = learningdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

If we simplify the model even further we start to loose some of validity of the model as adjusted p-value starts to decrease and therefore we might not want to only have one explanatory variable and therefore we are happy with model2.

We can further analyze model variables using Bayesian Information Criterion (BIC)

library(flexmix)
## Warning: package 'flexmix' was built under R version 4.2.3
## Loading required package: lattice
BIC(model)
## [1] 1046.051
BIC(model2)
## [1] 1041.486
BIC(model3)
## [1] 1039.323

According to BIC the model with only attitude as predictor is the best despite lower R-squared.

We can yet further use tools to compare different models. Command “step” drops predictors one by one until the smallest AIC value has reaches and further simplification would increase the AIC.

step(model)
## Start:  AIC=557.4
## Points ~ Attitude + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - surf      1     15.00 4559.4 555.95
## <none>                  4544.4 557.40
## - stra      1     69.61 4614.0 557.93
## - Attitude  1    980.95 5525.3 587.85
## 
## Step:  AIC=555.95
## Points ~ Attitude + stra
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  4559.4 555.95
## - stra      1     81.74 4641.1 556.90
## - Attitude  1   1051.89 5611.3 588.41
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
## 
## Coefficients:
## (Intercept)     Attitude         stra  
##      8.9729       0.3466       0.9137

Smallest AIC score is achieved with Attitude and stra as predictors (=model2), suggesting it is the best model. It probably could be argued that model2 and model3 are both valid, but i would prefer model2. So let’s take a closer look on that.

summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Model validication

According to summary table model intercept is 8.97. Parameter for attitude is 0.35 and for stra 0.91. So the formula for model would be \(Points=8.97+attitude*0.35+stra*0.91\). This means that 1 unit increase in attitude, increases points 0.35 and one unit increase of stra increases the points by 0.91. If both attetude and stra is 0, model suggests that test subject recive about than 9 points.

R-squared explains how much variation of dependent variable is explained by independent variable. .Multiple R-squared explains how much variation in dependent variable is explained by the model variables. 1 would mean all variation is explained by it and 0 would mean no variation is explained. Multiple \(R^2\) of this model is 0.20, suggesting that 20 % of the variation is explained by the independent variables. In addition summary table provides a adjusted \(R^2\). This adjustment makes it easier to compare models with different ammounts of predictorors, as higher number of predictors tend to increase \(R^2\) even if increasing the number if predictors would not increase the quality of the model as predictors will explain some percent of variance - even by coincidence.

plot(model2, which = 1)

Residuals vs fitted values are used to check for un-linearity. Residuals describe the distance between observation and model. If residuals have increasing or decreasing trend, we have issues with the model. Ie. we should have used nonlinear regression instead of linear. We can also detect outliers using this plot. the average of the residuals are close to zero trough the graph and there are no clear patterns so all is good. How ever we can detect there are some outliers in the data illustrated at the lower middle part of the graph.

plot(model2, which = 2)

Normal Q-Q plot sorts observations by residuals into ascending order. Majority of observations are where one would expect them to be in normally distributed data. Only the tails are a bit off from expected values. How ever as we have limited sample size I would not be too worried about that. As a conclusion: data is or is close to be normally distributed.

plot(model2, which = 5)

Residuals vs Leverage is used to find influential observations. High leverage means changes in that observations would affect the model more compared to observations with low leverage. In this case it seems there are no highly influential observations as all of them are within cook’s distance (we can’t even see the threshold in this zoom). We can also see that residuals do not really change as a function of a leverage, indicating yet again normal distribution and that variances of observations have no patterns.



Chapter 3: Linear regression

Disclaimer! I have had super busy week and unfortunately I did not have enough time to really focus this week and I am coding this last minute before deadline. So if and when i have made typos and silly mistakes, please be patient and I am sorry.

date()
## [1] "Mon Nov 27 13:33:19 2023"

Data

First step is to read the data created during the data wrangling part. In this case it is stored on my computer.

alc=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/alcdata.csv")
head(alc)
##   X school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4 4     GP   F  15       U     GT3       T    4    2   health services
## 5 5     GP   F  16       U     GT3       T    3    3    other    other
## 6 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime schoolsup famsup activities nursery
## 1     course   mother          2         2       yes     no         no     yes
## 2     course   father          1         2        no    yes         no      no
## 3      other   mother          1         2       yes     no         no     yes
## 4       home   mother          1         3        no    yes        yes     yes
## 5       home   father          1         2        no    yes         no     yes
## 6 reputation   mother          1         2        no    yes        yes     yes
##   higher internet romantic famrel freetime goout Dalc Walc health failures paid
## 1    yes       no       no      4        3     4    1    1      3        0   no
## 2    yes      yes       no      5        3     3    1    1      3        0   no
## 3    yes      yes       no      4        3     2    2    3      3        2  yes
## 4    yes      yes      yes      3        2     2    1    1      5        0  yes
## 5    yes       no       no      4        3     2    1    2      5        0  yes
## 6    yes      yes       no      5        4     2    1    2      5        0  yes
##   absences G1 G2 G3 alc_use high_use
## 1        5  2  8  8     1.0    FALSE
## 2        3  7  8  8     1.0    FALSE
## 3        8 10 10 11     2.5     TRUE
## 4        1 14 14 14     1.0    FALSE
## 5        2  8 12 12     1.5    FALSE
## 6        8 14 14 14     1.5    FALSE

Let’s remove the first column “x” as it is duplicate of row numbers

alc=alc[,-1]

Let’s take a look on our data

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

We have 35 columns describing information about students. Some variables describe background information. Variables G1, G2 and G3 decribes the grades of testsubjects. Alc_use tells how large is alcohol consumption and high use tells whether the consumption is high or not.


The 4 variables i would like to choose are absence, G3, pstatus and age. I assume that high absence, low grades, broken families and older age is linked to high consumption of alcohol.

First I would like to check how those variables are related to each other. But before that, let’s instal some packages!

library(tidyr); library(dplyr); library(ggplot2)
## Warning: package 'tidyr' was built under R version 4.2.3
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
table(alc$Pstatus, alc$high_use)
##    
##     FALSE TRUE
##   A    26   12
##   T   233   99
g1 <- ggplot(alc, aes(x = high_use, y = age))
g2 <- ggplot(alc, aes(x = high_use, y = G3))
g3 <- ggplot(alc, aes(x = high_use, y = absences))


g1 + geom_boxplot()

g2 + geom_boxplot()

g3 + geom_boxplot()

Using tables and boxplots we can assume

  • high consumption of alcohol is slightly more common in broken families
  • Young people use less alcohol
  • People with high grades have smaller alcohol consumption
  • people with high alcohol consumption are more often absence

Therefore hypotheses presented seem to be correct.


m <- glm(high_use ~ G3 + absences + age + Pstatus, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ G3 + absences + age + Pstatus, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3352  -0.8239  -0.7050   1.2037   1.8781  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.52154    1.82438  -1.382 0.166931    
## G3          -0.06845    0.03602  -1.900 0.057371 .  
## absences     0.08199    0.02335   3.511 0.000446 ***
## age          0.12084    0.10147   1.191 0.233666    
## PstatusT     0.05517    0.39732   0.139 0.889572    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 428.47  on 365  degrees of freedom
## AIC: 438.47
## 
## Number of Fisher Scoring iterations: 4

According to the summary of created model only G3 and absences are statistically significant. Family status has only little to do with the alcohol consumption. More simple model would be better. For that we can use step-command that finds the model with smallest (=best) AIC score.

step(m)
## Start:  AIC=438.47
## high_use ~ G3 + absences + age + Pstatus
## 
##            Df Deviance    AIC
## - Pstatus   1   428.49 436.49
## - age       1   429.89 437.89
## <none>          428.47 438.47
## - G3        1   432.09 440.09
## - absences  1   442.85 450.85
## 
## Step:  AIC=436.49
## high_use ~ G3 + absences + age
## 
##            Df Deviance    AIC
## - age       1   429.93 435.93
## <none>          428.49 436.49
## - G3        1   432.17 438.17
## - absences  1   443.07 449.07
## 
## Step:  AIC=435.93
## high_use ~ G3 + absences
## 
##            Df Deviance    AIC
## <none>          429.93 435.93
## - G3        1   434.52 438.52
## - absences  1   445.68 449.68
## 
## Call:  glm(formula = high_use ~ G3 + absences, family = "binomial", 
##     data = alc)
## 
## Coefficients:
## (Intercept)           G3     absences  
##    -0.38732     -0.07606      0.08423  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  367 Residual
## Null Deviance:       452 
## Residual Deviance: 429.9     AIC: 435.9

In this case such model would have just 2 predictors: absence and G3. So let’s create such model.

m <- glm(high_use ~ G3 + absences, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3286  -0.8298  -0.7219   1.2113   1.9242  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.38732    0.43500  -0.890 0.373259    
## G3          -0.07606    0.03553  -2.141 0.032311 *  
## absences     0.08423    0.02309   3.648 0.000264 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 429.93  on 367  degrees of freedom
## AIC: 435.93
## 
## Number of Fisher Scoring iterations: 4

So according to he model the astimate for parameter G3 is -0.08, and parameter for absence is 0.084. intercept is -0.39. To calculate odds ratios, we use following command:

OR <- coef(m) %>% exp
OR
## (Intercept)          G3    absences 
##   0.6788751   0.9267616   1.0878762

Ratios describe the change of predictable variable if the explainable variable is changed one unit.

CI=confint(m)
## Waiting for profiling to be done...
CI
##                   2.5 %       97.5 %
## (Intercept) -1.25260555  0.459482521
## G3          -0.14621889 -0.006471056
## absences     0.04110528  0.131708308

This provides 95% confidence intervals. 97.5 being the upper and 2.5 the lower interval.

Hypothesis vise this would mean that age and family status does not have important role on does one consume high amounts of alcohol or not. And that high grades means less alcohol and hing absences is tied to high alcohol consumption. How eve it is unclear which one is the cause and which is the result.


alc = mutate(alc, probability = predict(m, type = "response"))
alc = mutate(alc, prediction =probability>0.5)
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.68108108 0.01891892
##    TRUE  0.25675676 0.04324324

model was good at predicting if high use is false. How ever it also often assumed high use of alcohol even tough actually it was not the case.

g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))

# define the geom as points and draw the plot
g+geom_point()

The visual representation shows that the model often assumes High use to be false even tough it actually would be true.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(alc$high_use,alc$probability)
## [1] 0.2756757

The training error (the mean number of wrong predictions) is 0,28


library(boot)
## Warning: package 'boot' was built under R version 4.2.3
## 
## Attaching package: 'boot'
## The following object is masked from 'package:flexmix':
## 
##     boot
## The following object is masked from 'package:lattice':
## 
##     melanoma
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2837838

The average number of wrong predictions is 0,2811, which is a greater than error in exercise set (0.26). This means we have worse model than in exercise 3. Lets try to find better model.

m2 <- glm(high_use ~ G3 + absences+ school + sex + address+ famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup+ famsup+ activities + nursery + higher + internet + romantic + famrel + freetime + goout + health + failures + paid + absences + G1 + G2, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ G3 + absences + school + sex + address + 
##     famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + 
##     guardian + traveltime + studytime + schoolsup + famsup + 
##     activities + nursery + higher + internet + romantic + famrel + 
##     freetime + goout + health + failures + paid + absences + 
##     G1 + G2, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7620  -0.6544  -0.3590   0.4831   2.8990  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.10711    1.79919  -1.727 0.084177 .  
## G3                0.05976    0.11463   0.521 0.602121    
## absences          0.09105    0.02706   3.365 0.000766 ***
## schoolMS         -0.48058    0.53496  -0.898 0.369001    
## sexM              0.84885    0.33281   2.551 0.010755 *  
## addressU         -0.85479    0.41943  -2.038 0.041552 *  
## famsizeLE3        0.51405    0.33078   1.554 0.120176    
## PstatusT         -0.29228    0.53799  -0.543 0.586937    
## Medu              0.04021    0.22256   0.181 0.856645    
## Fedu              0.14434    0.19799   0.729 0.465991    
## Mjobhealth       -1.07147    0.76687  -1.397 0.162353    
## Mjobother        -0.85282    0.50337  -1.694 0.090222 .  
## Mjobservices     -0.82754    0.58036  -1.426 0.153896    
## Mjobteacher      -0.34813    0.70825  -0.492 0.623052    
## Fjobhealth        0.33260    1.19154   0.279 0.780142    
## Fjobother         0.81603    0.86311   0.945 0.344431    
## Fjobservices      1.19959    0.88620   1.354 0.175853    
## Fjobteacher      -0.34774    1.04522  -0.333 0.739367    
## reasonhome        0.55292    0.39470   1.401 0.161262    
## reasonother       1.52294    0.54867   2.776 0.005508 ** 
## reasonreputation  0.01215    0.41787   0.029 0.976799    
## guardianmother   -0.83744    0.37652  -2.224 0.026136 *  
## guardianother    -0.13975    0.80413  -0.174 0.862026    
## traveltime        0.32446    0.24022   1.351 0.176797    
## studytime        -0.27488    0.20833  -1.319 0.187012    
## schoolsupyes      0.03707    0.47710   0.078 0.938071    
## famsupyes        -0.07901    0.33491  -0.236 0.813504    
## activitiesyes    -0.58242    0.31635  -1.841 0.065608 .  
## nurseryyes       -0.48277    0.36927  -1.307 0.191090    
## higheryes        -0.30306    0.73540  -0.412 0.680263    
## internetyes       0.09333    0.45868   0.203 0.838771    
## romanticyes      -0.44975    0.33749  -1.333 0.182662    
## famrel           -0.57891    0.17076  -3.390 0.000698 ***
## freetime          0.29873    0.17003   1.757 0.078928 .  
## goout             0.90284    0.15767   5.726 1.03e-08 ***
## health            0.18851    0.11495   1.640 0.101042    
## failures          0.33167    0.27916   1.188 0.234790    
## paidyes           0.53225    0.32442   1.641 0.100883    
## G1               -0.13568    0.12864  -1.055 0.291563    
## G2                0.10089    0.15969   0.632 0.527532    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 310.30  on 330  degrees of freedom
## AIC: 390.3
## 
## Number of Fisher Scoring iterations: 5
step(m2)
## Start:  AIC=390.3
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + 
##     traveltime + studytime + schoolsup + famsup + activities + 
##     nursery + higher + internet + romantic + famrel + freetime + 
##     goout + health + failures + paid + absences + G1 + G2
## 
##              Df Deviance    AIC
## - Mjob        4   314.58 386.58
## - schoolsup   1   310.30 388.30
## - Medu        1   310.33 388.33
## - internet    1   310.34 388.34
## - famsup      1   310.35 388.35
## - higher      1   310.47 388.47
## - G3          1   310.57 388.57
## - Pstatus     1   310.59 388.59
## - G2          1   310.70 388.70
## - Fedu        1   310.83 388.83
## - Fjob        4   317.06 389.06
## - school      1   311.12 389.12
## - G1          1   311.40 389.40
## - failures    1   311.73 389.73
## - nursery     1   311.99 389.99
## - studytime   1   312.08 390.08
## - romantic    1   312.11 390.11
## - traveltime  1   312.14 390.14
## <none>            310.30 390.30
## - famsize     1   312.70 390.70
## - paid        1   313.03 391.03
## - health      1   313.05 391.05
## - freetime    1   313.44 391.44
## - guardian    2   315.67 391.67
## - activities  1   313.75 391.75
## - address     1   314.45 392.45
## - reason      3   319.54 393.54
## - sex         1   316.96 394.96
## - absences    1   321.59 399.59
## - famrel      1   322.20 400.20
## - goout       1   350.56 428.56
## 
## Step:  AIC=386.58
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Pstatus + Medu + Fedu + Fjob + reason + guardian + traveltime + 
##     studytime + schoolsup + famsup + activities + nursery + higher + 
##     internet + romantic + famrel + freetime + goout + health + 
##     failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - Fjob        4   320.37 384.37
## - Medu        1   314.58 384.58
## - schoolsup   1   314.59 384.59
## - internet    1   314.63 384.63
## - Pstatus     1   314.67 384.67
## - famsup      1   314.72 384.72
## - higher      1   314.82 384.82
## - G2          1   314.84 384.84
## - G3          1   314.92 384.92
## - school      1   315.37 385.37
## - G1          1   315.71 385.71
## - studytime   1   315.74 385.74
## - Fedu        1   315.77 385.77
## - failures    1   316.08 386.08
## - romantic    1   316.33 386.33
## - health      1   316.39 386.39
## - nursery     1   316.44 386.44
## - traveltime  1   316.50 386.50
## <none>            314.58 386.58
## - guardian    2   318.64 386.64
## - famsize     1   317.00 387.00
## - freetime    1   317.40 387.40
## - activities  1   317.68 387.68
## - paid        1   318.01 388.01
## - reason      3   322.94 388.94
## - address     1   319.29 389.29
## - sex         1   321.42 391.42
## - famrel      1   326.22 396.22
## - absences    1   326.30 396.30
## - goout       1   353.06 423.06
## 
## Step:  AIC=384.37
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Pstatus + Medu + Fedu + reason + guardian + traveltime + 
##     studytime + schoolsup + famsup + activities + nursery + higher + 
##     internet + romantic + famrel + freetime + goout + health + 
##     failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - Pstatus     1   320.39 382.39
## - internet    1   320.40 382.40
## - schoolsup   1   320.41 382.41
## - Medu        1   320.44 382.44
## - higher      1   320.51 382.51
## - G3          1   320.61 382.61
## - famsup      1   320.72 382.72
## - G2          1   320.84 382.84
## - Fedu        1   320.90 382.90
## - school      1   321.08 383.08
## - studytime   1   321.52 383.52
## - traveltime  1   321.79 383.79
## - failures    1   321.83 383.83
## - nursery     1   322.04 384.04
## - G1          1   322.23 384.23
## - romantic    1   322.23 384.23
## <none>            320.37 384.37
## - health      1   322.38 384.38
## - freetime    1   322.53 384.53
## - famsize     1   322.93 384.93
## - activities  1   323.16 385.16
## - guardian    2   325.32 385.32
## - reason      3   327.69 385.69
## - paid        1   324.81 386.81
## - address     1   325.55 387.55
## - sex         1   327.22 389.22
## - famrel      1   330.47 392.47
## - absences    1   332.72 394.72
## - goout       1   360.29 422.29
## 
## Step:  AIC=382.39
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Medu + Fedu + reason + guardian + traveltime + studytime + 
##     schoolsup + famsup + activities + nursery + higher + internet + 
##     romantic + famrel + freetime + goout + health + failures + 
##     paid + G1 + G2
## 
##              Df Deviance    AIC
## - internet    1   320.42 380.42
## - schoolsup   1   320.43 380.43
## - Medu        1   320.45 380.45
## - higher      1   320.53 380.53
## - G3          1   320.65 380.65
## - famsup      1   320.75 380.75
## - G2          1   320.85 380.85
## - Fedu        1   320.93 380.93
## - school      1   321.12 381.12
## - studytime   1   321.53 381.53
## - traveltime  1   321.81 381.81
## - failures    1   321.86 381.86
## - nursery     1   322.05 382.05
## - romantic    1   322.25 382.25
## - G1          1   322.29 382.29
## - health      1   322.39 382.39
## <none>            320.39 382.39
## - freetime    1   322.53 382.53
## - famsize     1   323.01 383.01
## - activities  1   323.26 383.26
## - guardian    2   325.33 383.33
## - reason      3   327.71 383.71
## - paid        1   324.81 384.81
## - address     1   325.57 385.57
## - sex         1   327.24 387.24
## - famrel      1   330.51 390.51
## - absences    1   332.91 392.91
## - goout       1   360.30 420.30
## 
## Step:  AIC=380.42
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Medu + Fedu + reason + guardian + traveltime + studytime + 
##     schoolsup + famsup + activities + nursery + higher + romantic + 
##     famrel + freetime + goout + health + failures + paid + G1 + 
##     G2
## 
##              Df Deviance    AIC
## - schoolsup   1   320.45 378.45
## - Medu        1   320.47 378.47
## - higher      1   320.57 378.57
## - G3          1   320.67 378.67
## - famsup      1   320.77 378.77
## - G2          1   320.87 378.87
## - Fedu        1   320.95 378.95
## - school      1   321.13 379.13
## - studytime   1   321.55 379.55
## - traveltime  1   321.83 379.83
## - failures    1   321.87 379.87
## - nursery     1   322.13 380.13
## - romantic    1   322.25 380.25
## - G1          1   322.30 380.30
## <none>            320.42 380.42
## - health      1   322.42 380.42
## - freetime    1   322.59 380.59
## - famsize     1   323.06 381.06
## - activities  1   323.26 381.26
## - guardian    2   325.37 381.37
## - reason      3   327.73 381.73
## - paid        1   324.93 382.93
## - address     1   325.67 383.67
## - sex         1   327.31 385.31
## - famrel      1   330.52 388.52
## - absences    1   333.27 391.27
## - goout       1   360.46 418.46
## 
## Step:  AIC=378.45
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Medu + Fedu + reason + guardian + traveltime + studytime + 
##     famsup + activities + nursery + higher + romantic + famrel + 
##     freetime + goout + health + failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - Medu        1   320.50 376.50
## - higher      1   320.61 376.61
## - G3          1   320.69 376.69
## - famsup      1   320.80 376.80
## - G2          1   320.89 376.89
## - Fedu        1   320.97 376.97
## - school      1   321.13 377.13
## - studytime   1   321.60 377.60
## - traveltime  1   321.84 377.84
## - failures    1   321.90 377.90
## - nursery     1   322.16 378.16
## - romantic    1   322.27 378.27
## - G1          1   322.32 378.32
## <none>            320.45 378.45
## - health      1   322.49 378.49
## - freetime    1   322.61 378.61
## - famsize     1   323.10 379.10
## - activities  1   323.35 379.35
## - guardian    2   325.38 379.38
## - reason      3   327.75 379.75
## - paid        1   325.06 381.06
## - address     1   325.79 381.79
## - sex         1   327.67 383.67
## - famrel      1   330.54 386.54
## - absences    1   333.37 389.37
## - goout       1   360.66 416.66
## 
## Step:  AIC=376.5
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Fedu + reason + guardian + traveltime + studytime + famsup + 
##     activities + nursery + higher + romantic + famrel + freetime + 
##     goout + health + failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - higher      1   320.66 374.66
## - G3          1   320.75 374.75
## - famsup      1   320.87 374.87
## - G2          1   320.93 374.93
## - Fedu        1   321.03 375.03
## - school      1   321.18 375.18
## - studytime   1   321.65 375.65
## - traveltime  1   321.91 375.91
## - failures    1   322.01 376.01
## - nursery     1   322.28 376.28
## - romantic    1   322.35 376.35
## - G1          1   322.37 376.37
## <none>            320.50 376.50
## - health      1   322.54 376.54
## - freetime    1   322.65 376.65
## - famsize     1   323.18 377.18
## - activities  1   323.42 377.42
## - guardian    2   325.60 377.60
## - reason      3   327.81 377.81
## - paid        1   325.07 379.07
## - address     1   325.93 379.93
## - sex         1   327.67 381.67
## - famrel      1   330.59 384.59
## - absences    1   333.40 387.40
## - goout       1   360.66 414.66
## 
## Step:  AIC=374.66
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Fedu + reason + guardian + traveltime + studytime + famsup + 
##     activities + nursery + romantic + famrel + freetime + goout + 
##     health + failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - G3          1   320.88 372.88
## - famsup      1   321.06 373.06
## - G2          1   321.10 373.10
## - Fedu        1   321.14 373.14
## - school      1   321.38 373.38
## - studytime   1   321.86 373.86
## - traveltime  1   322.07 374.07
## - failures    1   322.33 374.33
## - romantic    1   322.38 374.38
## - nursery     1   322.40 374.40
## - G1          1   322.55 374.55
## <none>            320.66 374.66
## - health      1   322.67 374.67
## - freetime    1   322.79 374.79
## - famsize     1   323.32 375.32
## - activities  1   323.79 375.79
## - guardian    2   325.86 375.86
## - reason      3   327.93 375.93
## - paid        1   325.11 377.11
## - address     1   326.17 378.17
## - sex         1   328.20 380.20
## - famrel      1   330.73 382.73
## - absences    1   333.75 385.75
## - goout       1   360.88 412.88
## 
## Step:  AIC=372.88
## high_use ~ absences + school + sex + address + famsize + Fedu + 
##     reason + guardian + traveltime + studytime + famsup + activities + 
##     nursery + romantic + famrel + freetime + goout + health + 
##     failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - famsup      1   321.23 371.23
## - Fedu        1   321.32 371.32
## - school      1   321.66 371.66
## - studytime   1   322.19 372.19
## - traveltime  1   322.34 372.34
## - failures    1   322.42 372.42
## - G2          1   322.54 372.54
## - romantic    1   322.56 372.56
## - G1          1   322.60 372.60
## - nursery     1   322.63 372.63
## <none>            320.88 372.88
## - health      1   322.91 372.91
## - freetime    1   323.01 373.01
## - famsize     1   323.45 373.45
## - guardian    2   325.96 373.96
## - activities  1   323.98 373.98
## - reason      3   327.99 373.99
## - paid        1   325.33 375.33
## - address     1   326.35 376.35
## - sex         1   328.32 378.32
## - famrel      1   330.73 380.73
## - absences    1   334.31 384.31
## - goout       1   361.17 411.17
## 
## Step:  AIC=371.23
## high_use ~ absences + school + sex + address + famsize + Fedu + 
##     reason + guardian + traveltime + studytime + activities + 
##     nursery + romantic + famrel + freetime + goout + health + 
##     failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - Fedu        1   321.57 369.57
## - school      1   321.87 369.87
## - traveltime  1   322.66 370.66
## - studytime   1   322.74 370.74
## - failures    1   322.78 370.78
## - G2          1   322.87 370.87
## - G1          1   322.87 370.87
## - nursery     1   322.96 370.96
## - romantic    1   322.98 370.98
## - health      1   323.12 371.12
## <none>            321.23 371.23
## - freetime    1   323.23 371.23
## - famsize     1   323.87 371.87
## - guardian    2   326.22 372.22
## - activities  1   324.31 372.31
## - reason      3   328.42 372.42
## - paid        1   325.35 373.35
## - address     1   326.60 374.60
## - sex         1   329.13 377.13
## - famrel      1   330.86 378.86
## - absences    1   334.61 382.61
## - goout       1   361.79 409.79
## 
## Step:  AIC=369.57
## high_use ~ absences + school + sex + address + famsize + reason + 
##     guardian + traveltime + studytime + activities + nursery + 
##     romantic + famrel + freetime + goout + health + failures + 
##     paid + G1 + G2
## 
##              Df Deviance    AIC
## - school      1   322.18 368.18
## - traveltime  1   322.86 368.86
## - failures    1   322.87 368.87
## - nursery     1   323.07 369.07
## - G1          1   323.12 369.12
## - studytime   1   323.23 369.23
## - G2          1   323.23 369.23
## - romantic    1   323.28 369.28
## - freetime    1   323.44 369.44
## <none>            321.57 369.57
## - health      1   323.58 369.58
## - famsize     1   324.02 370.02
## - activities  1   324.48 370.48
## - reason      3   328.70 370.70
## - guardian    2   327.05 371.05
## - paid        1   325.82 371.82
## - address     1   326.98 372.98
## - sex         1   329.62 375.62
## - famrel      1   331.29 377.29
## - absences    1   335.46 381.46
## - goout       1   363.61 409.61
## 
## Step:  AIC=368.18
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     traveltime + studytime + activities + nursery + romantic + 
##     famrel + freetime + goout + health + failures + paid + G1 + 
##     G2
## 
##              Df Deviance    AIC
## - traveltime  1   323.22 367.22
## - G1          1   323.54 367.54
## - nursery     1   323.57 367.57
## - failures    1   323.64 367.64
## - G2          1   323.71 367.71
## - studytime   1   323.76 367.76
## - freetime    1   323.86 367.86
## - romantic    1   324.04 368.04
## <none>            322.18 368.18
## - health      1   324.53 368.53
## - famsize     1   324.55 368.55
## - reason      3   328.79 368.79
## - activities  1   324.93 368.93
## - guardian    2   327.39 369.39
## - paid        1   326.30 370.30
## - address     1   326.99 370.99
## - sex         1   330.30 374.30
## - famrel      1   331.74 375.74
## - absences    1   336.59 380.59
## - goout       1   364.02 408.02
## 
## Step:  AIC=367.22
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + nursery + romantic + famrel + freetime + 
##     goout + health + failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - G1          1   324.57 366.57
## - G2          1   324.63 366.63
## - nursery     1   324.69 366.69
## - freetime    1   324.75 366.75
## - failures    1   324.81 366.81
## - studytime   1   324.99 366.99
## - romantic    1   325.10 367.10
## <none>            323.22 367.22
## - reason      3   329.53 367.53
## - health      1   325.57 367.57
## - famsize     1   325.97 367.97
## - activities  1   326.01 368.01
## - guardian    2   329.24 369.24
## - paid        1   327.40 369.40
## - address     1   330.64 372.64
## - sex         1   331.41 373.41
## - famrel      1   332.55 374.55
## - absences    1   337.62 379.62
## - goout       1   366.10 408.10
## 
## Step:  AIC=366.57
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + nursery + romantic + famrel + freetime + 
##     goout + health + failures + paid + G2
## 
##              Df Deviance    AIC
## - G2          1   324.70 364.70
## - freetime    1   325.93 365.93
## - nursery     1   326.11 366.11
## - failures    1   326.24 366.24
## - studytime   1   326.54 366.54
## <none>            324.57 366.57
## - health      1   327.05 367.05
## - romantic    1   327.05 367.05
## - famsize     1   327.24 367.24
## - reason      3   331.26 367.26
## - activities  1   327.40 367.40
## - guardian    2   330.30 368.30
## - paid        1   329.01 369.01
## - address     1   331.70 371.70
## - sex         1   332.66 372.66
## - famrel      1   334.08 374.08
## - absences    1   338.85 378.85
## - goout       1   366.53 406.53
## 
## Step:  AIC=364.7
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + nursery + romantic + famrel + freetime + 
##     goout + health + failures + paid
## 
##              Df Deviance    AIC
## - freetime    1   326.08 364.08
## - nursery     1   326.21 364.21
## - failures    1   326.24 364.24
## - studytime   1   326.57 364.57
## <none>            324.70 364.70
## - health      1   327.09 365.09
## - romantic    1   327.32 365.32
## - activities  1   327.43 365.43
## - reason      3   331.46 365.46
## - famsize     1   327.50 365.50
## - guardian    2   330.37 366.37
## - paid        1   329.18 367.18
## - address     1   331.74 369.74
## - sex         1   332.86 370.86
## - famrel      1   334.27 372.27
## - absences    1   338.85 376.85
## - goout       1   367.17 405.17
## 
## Step:  AIC=364.08
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + nursery + romantic + famrel + goout + 
##     health + failures + paid
## 
##              Df Deviance    AIC
## - nursery     1   327.51 363.51
## - failures    1   327.64 363.64
## - studytime   1   328.04 364.04
## <none>            326.08 364.08
## - health      1   328.44 364.44
## - romantic    1   328.46 364.46
## - activities  1   328.54 364.54
## - reason      3   332.61 364.61
## - famsize     1   328.73 364.73
## - guardian    2   332.13 366.13
## - paid        1   330.49 366.49
## - address     1   332.70 368.70
## - famrel      1   335.05 371.05
## - sex         1   335.40 371.40
## - absences    1   339.96 375.96
## - goout       1   375.82 411.82
## 
## Step:  AIC=363.51
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + romantic + famrel + goout + health + 
##     failures + paid
## 
##              Df Deviance    AIC
## - failures    1   329.01 363.01
## <none>            327.51 363.51
## - famsize     1   329.70 363.70
## - romantic    1   329.86 363.86
## - studytime   1   329.87 363.87
## - activities  1   329.88 363.88
## - reason      3   333.89 363.89
## - health      1   329.95 363.95
## - paid        1   331.64 365.64
## - guardian    2   334.22 366.22
## - address     1   334.73 368.73
## - famrel      1   336.36 370.36
## - sex         1   336.42 370.42
## - absences    1   340.99 374.99
## - goout       1   376.85 410.85
## 
## Step:  AIC=363.01
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + romantic + famrel + goout + health + 
##     paid
## 
##              Df Deviance    AIC
## <none>            329.01 363.01
## - famsize     1   331.20 363.20
## - romantic    1   331.25 363.25
## - reason      3   335.60 363.60
## - activities  1   331.61 363.61
## - health      1   331.97 363.97
## - studytime   1   332.12 364.12
## - paid        1   332.59 364.59
## - guardian    2   335.95 365.95
## - address     1   336.70 368.70
## - sex         1   338.10 370.10
## - famrel      1   338.76 370.76
## - absences    1   343.00 375.00
## - goout       1   382.34 414.34
## 
## Call:  glm(formula = high_use ~ absences + sex + address + famsize + 
##     reason + guardian + studytime + activities + romantic + famrel + 
##     goout + health + paid, family = "binomial", data = alc)
## 
## Coefficients:
##      (Intercept)          absences              sexM          addressU  
##         -1.72681           0.09174           0.91034          -0.94771  
##       famsizeLE3        reasonhome       reasonother  reasonreputation  
##          0.45183           0.20790           1.02210          -0.27714  
##   guardianmother     guardianother         studytime     activitiesyes  
##         -0.75088           0.40855          -0.32528          -0.46361  
##      romanticyes            famrel             goout            health  
##         -0.45457          -0.48800           0.90858           0.17681  
##          paidyes  
##          0.54838  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  353 Residual
## Null Deviance:       452 
## Residual Deviance: 329   AIC: 363
m3 <- glm(high_use ~ absences + sex + address + famsize + 
    reason + guardian + studytime + activities + romantic + famrel + 
    goout + health + paid, data = alc, family = "binomial")
summary(m3)
## 
## Call:
## glm(formula = high_use ~ absences + sex + address + famsize + 
##     reason + guardian + studytime + activities + romantic + famrel + 
##     goout + health + paid, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8674  -0.7171  -0.4029   0.5821   2.7665  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.72681    0.94633  -1.825 0.068041 .  
## absences          0.09174    0.02432   3.773 0.000161 ***
## sexM              0.91034    0.30568   2.978 0.002901 ** 
## addressU         -0.94771    0.34320  -2.761 0.005756 ** 
## famsizeLE3        0.45183    0.30426   1.485 0.137536    
## reasonhome        0.20790    0.36585   0.568 0.569858    
## reasonother       1.02210    0.48846   2.093 0.036393 *  
## reasonreputation -0.27714    0.37752  -0.734 0.462877    
## guardianmother   -0.75088    0.33290  -2.256 0.024096 *  
## guardianother     0.40855    0.73349   0.557 0.577533    
## studytime        -0.32528    0.18830  -1.727 0.084081 .  
## activitiesyes    -0.46361    0.28927  -1.603 0.109002    
## romanticyes      -0.45457    0.30736  -1.479 0.139160    
## famrel           -0.48800    0.15813  -3.086 0.002028 ** 
## goout             0.90858    0.13835   6.567 5.13e-11 ***
## health            0.17681    0.10399   1.700 0.089096 .  
## paidyes           0.54838    0.29252   1.875 0.060834 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 329.01  on 353  degrees of freedom
## AIC: 363.01
## 
## Number of Fisher Scoring iterations: 5
alc2 = mutate(alc, probability = predict(m3, type = "response"))
alc = mutate(alc, prediction =probability>0.5)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(alc$high_use,alc$probability)
## [1] 0.2756757
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2837838

I tried to create as good model as possible by introducing all variable to a model and used step function to find model with smallest AIC ending up with model m3. How ever when calculating the error, it was 0.2783. Just slightly smaller than on the original model. In addition, i am coding this late at night as the deadline is looming, so i must just give up and declare that i cannot find better model than on exercise 3.


Chapter 4: Clustering and classification

date()
## [1] "Mon Nov 27 13:33:24 2023"

Data

Let’s import data. We will use “Boston” data from MASS package.

library(MASS)
## Warning: package 'MASS' was built under R version 4.2.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
data("Boston")

Now we can explore how the data looks

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Dataset “Boston” has 506 observations and 14 variables. Most of the varables are numeric, but two are integrals. Variables of the data and their diecription are presented bellow.

Variable Descripition
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town.
chas Charles River dummy variable
nox nitrogen oxides concentration
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black proportion of blacks by town
lstat lower status of the population
medv median value of owner-occupied homes in $1000s.

Summary of the data

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Here you can see the pair-wise comparisation of each variable. Matrix is fairly small in size but provides some visual insigts.

One can see that some variables have little relationships (such as age and ptratio) whereas some have strong relationship (nox and dis). Correlation matrix could be easier to interpret.

## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded

##Scaling the data

boston_scaled= scale(Boston)
boston_scaled=as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

After scaling the mean of each variable is 0. So instead of absolute values the value suggests how far from the mean within that variable the observation is.

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Divide data

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear discriminant analysis

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

Here we can find see what variables separates the data into different groups. It seems that rad has the greatest effect on LD1 where as nox has the greatest effect on LD2. If i interpret the graph correctly, access to high ways is linked to high crime rate and nitrogen oxides to medium high rates.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      11        1    0
##   med_low    6      14        2    0
##   med_high   0       8       16    3
##   high       0       0        1   25

In the most cases the model made correct predictions. Especially success rate on predicting high crime rate was good. Predicting medium low crimes seems to be the most difficult. How ever the predictions are not perfect: for example in one case the model predicted medium high crime rate even tough actual rate was low.

data(Boston)
scal_bos=as.data.frame(scale(Boston))
dist_eu <- dist(scal_bos)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

##K-means

Let’s calculate k-means

km <- kmeans(scal_bos, centers = 5)
summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       70    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

How ever the means calculated above would be better if we would fing out the optimal number of clusetrs. So let us do that!

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(scal_bos, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

km <- kmeans(scal_bos, centers = 2)

pairs(scal_bos, col = km$cluster)

It is hard to see what is going on, so let’s take a closer look on some of interesting looking variables

pairs(scal_bos[c(1,10)], col = km$cluster)

pairs(scal_bos[c(1,7)], col = km$cluster)

For example on these two plots one can see clear clustering with crime and tax. High tax and high crime rate seems to create two different clusters. This suggests that there are less crime in areas with high tax.

It also seems that crime rate increases in with owner-occupied units built prior to 1940.


Bonus!

km <- kmeans(scal_bos, centers = 4)
lda.fit <- lda(km$cluster ~ ., data = scal_bos)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2.5)

The most influential variable is black on LD2. On LD1 it is hard to determine, but it might be nox.

Super bonus!

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
lda.fit <- lda(crime ~ ., data = train)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

```